-
Notifications
You must be signed in to change notification settings - Fork 164
feat(BA-2851): Add resource isolation options for multi-agent #6498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e6c1f4b to
d84258e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces resource isolation options for multi-agent setups, enabling multiple agents to run on the same physical host with controlled resource allocation. The implementation adds three allocation modes: SHARED (default, backward compatible), AUTO_SPLIT (automatic equal division), and MANUAL (explicit per-agent configuration).
Key changes:
- Introduces
ResourcePartitionerclass to manage resource allocation across agents - Adds
ResourceAllocationModeenum with SHARED, AUTO_SPLIT, and MANUAL modes - Implements validation logic to ensure consistent manual allocations across agents
- Updates agent initialization to use resource partitioning
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 23 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ai/backend/agent/resources.py | Adds ResourcePartitioner class and changes abstract methods to raise NotImplementedError |
| src/ai/backend/agent/config/unified.py | Defines allocation modes, new config fields (allocated_cpu/mem/disk/devices), and validation logic |
| src/ai/backend/agent/agent.py | Integrates ResourcePartitioner into agent initialization and updates slot calculations |
| src/ai/backend/agent/server.py | Creates ResourcePartitioner instances per agent and adds resource reconciliation |
| src/ai/backend/agent/docker/agent.py | Adds resource_partitioner parameter to constructor |
| src/ai/backend/agent/kubernetes/agent.py | Adds resource_partitioner parameter to constructor |
| tests/agent/test_resource_allocation.py | Comprehensive unit tests for all three allocation modes |
| tests/agent/test_config_validation.py | Tests for config validation of allocation modes and device consistency |
| tests/agent/docker/test_agent.py | Updates test to pass ResourcePartitioner to agent |
| changes/6498.feature.md | Changelog entry |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d84258e to
c5114a9
Compare
9f12687 to
fdee4b0
Compare
310d847 to
3faac0f
Compare
fdee4b0 to
90f0702
Compare
36824ac to
279e71b
Compare
90f0702 to
e2b1902
Compare
280831f to
db07080
Compare
ce120ef to
13c7be6
Compare
e2b1902 to
9c34302
Compare
80ecc2c to
7462fd8
Compare
9c34302 to
04a0f3a
Compare
7462fd8 to
e17be1c
Compare
04a0f3a to
92c3bd7
Compare
e17be1c to
d936ce3
Compare
92c3bd7 to
f0f7510
Compare
66b563f to
89a0562
Compare
2d01787 to
6f857be
Compare
159bd39 to
80a5685
Compare
a8c60e7 to
4d718b6
Compare
b8cfb1b to
33552a1
Compare
4d718b6 to
c0102e8
Compare
|
|
||
| class SlotName(UserString): | ||
| __slots__ = ("_parsed", "_device_name", "_major_type", "_minor_type") | ||
| __match_args__ = ("device_name", "major_type", "minor_type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessary but I wanted to use pattern matching here (https://github.com/lablup/backend.ai/pull/6498/files#diff-a4da2a344d73525736025bcd638112245de4a7225d6293d21a7e1e5152224ec8R675) and this class needs this added for match statement to work
| async def _load_resources(self) -> Mapping[DeviceName, AbstractComputePlugin]: | ||
| local_config_dump = self.local_config.model_dump(by_alias=True) | ||
|
|
||
| match self.local_config.agent_common.backend: | ||
| case AgentBackend.DOCKER: | ||
| from .docker.resources import load_resources as docker_load | ||
|
|
||
| return await docker_load(self.etcd, local_config_dump) | ||
| case AgentBackend.KUBERNETES: | ||
| from .kubernetes.resources import load_resources as kubernetes_load | ||
|
|
||
| return await kubernetes_load(self.etcd, local_config_dump) | ||
| case AgentBackend.DUMMY: | ||
| from .dummy.config import DEFAULT_CONFIG_PATH, dummy_local_config | ||
| from .dummy.resources import load_resources as dummy_load | ||
|
|
||
| raw_config, _ = read_from_file(DEFAULT_CONFIG_PATH, "dummy") | ||
| dummy_config = dummy_local_config.check(raw_config) | ||
| return await dummy_load(self.etcd, local_config_dump, dummy_config) | ||
|
|
||
| async def _scan_available_resources(self) -> Mapping[SlotName, Decimal]: | ||
| compute_device_types = {name: cctx.instance for name, cctx in self.computers.items()} | ||
|
|
||
| match self.local_config.agent_common.backend: | ||
| case AgentBackend.DOCKER: | ||
| from .docker.resources import scan_available_resources as docker_scan | ||
|
|
||
| return await docker_scan(compute_device_types) | ||
| case AgentBackend.KUBERNETES: | ||
| from .kubernetes.resources import scan_available_resources as kubernetes_scan | ||
|
|
||
| return await kubernetes_scan(compute_device_types) | ||
| case AgentBackend.DUMMY: | ||
| from .dummy.resources import scan_available_resources as dummy_scan | ||
|
|
||
| return await dummy_scan(compute_device_types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change, in terms of extensibility, rather seems like a regressive implementation and doesn't look good.
c0102e8 to
73bf8f3
Compare
This change implements configuration for partitioning resources. SHARED mode allows all agents to see full resources (useful for stress testing). This is the same behavior as before. AUTO_SPLIT automatically divides resources equally among agents. MANUAL mode lets users specify exact per-agent allocations for all resources. Single-agent deployments remain unaffected and retain access to all available hardware resources.
This change modifies the semantics of ResourcePartitioner so that it now takes ownership over the devices and injects partitioned devices to individual agents after initialization.
This change fixes a bug with resource splitting, where reserved resources were accidentally being included in the total allocated for each agent. This is because the way total slots are handled was malformed, where the calculation of reserved resources from the perspective of a single agent was being done without taking account of server reserved resources properly. This change fixes this issue by inverting the condition, where reserved resources are deducted only in places where it is needed.
73bf8f3 to
2c41c25
Compare
|
Will create a new PR |
resolves #6432 (BA-2851)
This change adds configuration for partitioning resources rather than every agent always seeing the full resource pool. This prevents unintended over-allocation that could crash kernels.
Single-agent deployments remain unaffected and retain access to all available hardware resources.
Checklist: (if applicable)
ai.backend.testdocsdirectory